perm filename LIB.PRO[1,JMC] blob sn#005297 filedate 1971-02-17 generic text, type T, neo UTF8
00100	                            PROPOSAL FOR
00200	
00300	            A COMPUTER SCIENCE LIBRARY ON THE LASER FILE
00400	
00500	                by John McCarthy, Stanford University
00600	
00700		The  ARPA  10↑12  bit  file provides the first opportunity to
00800	store a library in machine usable form at  reasonable  cost.  Namely,
00900	the  capital  cost of storage on the file is 10↑-6 dollars per bit; a
01000	100,000 word book is about  4x10↑6  bits,  so  the  capital  cost  of
01100	storing  it is $4.00.  This is considerably less than the cost of the
01200	book and the storage space for it in a conventional library.
01300	
01400		We envisage  a  library  containing  all  important  computer
01500	science  and technology books, journals and reports; this would total
01600	between 2000 and 10,000 of the above-mentioned 100,000 word  volumes,
01700	so  that  we  estimate the storage costs as between $8000 and $40,000
01800	not resulting in any contribution to the  immediate  expense  of  the
01900	project since the file is already committed.
02000	
02100		We  envisage  the  file  being read through display consoles.
02200	According to a survey at the ARPA IPT contractor's meeting about  100
02300	suitable  consoles  are  already  in use or soon will be. This may be
02400	optimistic as not all the  consoles  may  be  suitable  for  reading.
02500	Someone who is reading at 600 words per minute will use approximately
02600	400 bits/second of the  network's  communication  capacity  which  is
02700	1/125  th  of one channel.  Naturally, this will be used in bursts of
02800	page length. (To transmit a 1000 word page  will  take  .8  seconds).
02900	Thus,  the  network's  capacity  will  not  be  strained  even in the
03000	unlikely event that the library turns on all the members of  all  the
03100	projects   to  reading  the  literature.  Browsing  and  the  use  of
03200	information  retrieval  programs  would  increase  the   data   rates
03300	required, but clearly experimental use cannot strain the network.
03400	
03500		A  substantial  library  of  reports  can  be created without
03600	worrying about copyright restrictions, but I think we can and  should
03700	try  to  get  publisher's  permission to include their material.  The
03800	limited number of users will clearly not cost  them  much  sales  and
03900	getting  the  material  in machine usable form will, in the long run,
04000	more than counterbalance this.   There is, however, much to  be  said
04100	for  negotiating  a royalty agreement based on the amount of usage of
04200	the material as  measured  by  the  system.   This  can  serve  as  a
04300	prototype for such agreements in the future.
04400	
04500		The  facility  should  be  regarded  as a library and not, in
04600	itself, as an information retrieval system.  Naturally,  computerised
04700	information  retrieval  systems  can  use  the library, and documents
04800	created by such work can be included in the library, but the  library
04900	itself  should  not  be  committed  to  any  particular  approach  to
05000	information retrieval.  The extreme of this is that each document has
05100	a number and every other kind of lookup must be accomplished with the
05200	help of programs that use auxiliary documents such  as  catalogs  and
05300	indexes  and  bibliographies.  In practice, there would have to be at
05400	least one librarian organization supported by ARPA to assure at least
05500	minimum facilities.
05600	
05700		Defense  Department use for such systems goes, of course, far
05800	beyond the computer science area, but computer science and technology
05900	is  the  right place to start because the consoles exist, the network
06000	exists, and the propensity to use such a system exists.  We  envisage
06100	DoD  systems  coming along about two years after the computer science
06200	system is demonstrated. It might be  worthwhile,  however,  to  begin
06300	work  early  on  the cryptography certification required to allow the
06400	file to be used for classified material.
06500	
06600		The major problem with such a library is getting a large body
06700	of  material  in  computer  usable form.  According to Dan Forsyth of
06800	Information International, present key-punching  costs  run  $.75  to
06900	$1.00  per thousand characters.    At this rate, the cost of entering
07000	the proposed computer science and technology library would be between
07100	$900,000  and  $6,000,000.  He  estimates  that his company's optical
07200	character recognition system might allow a contract  costing  between
07300	$.10  and  $.25  per 1000 characters.  This comes to between $120,000
07400	and $1,500,000.  Presumably, these uncertainties in the size  of  the
07500	collection  required  and the costs of converting it could be reduced
07600	rather quickly.  It is scarcely necessary to point out that there are
07700	may  commercial  approaches to optical character recognition, but the
07800	capacity of these approaches to handle the required wide  variety  of
07900	fonts  has  to  be looked into. Some combination of R&D contracts and
08000	fixed price contracts is the most likely  way  of  getting  the  work
08100	done.
08200	
08300		Much printed material is already prepared in machine readable
08400	form for the use of the printers.  It  will  take  quite  an  effort,
08500	however, to get all that material read into a computer.
08600	
08700		Considerable  standardization  effort  will  be  required  to
08800	devise a good system  of  representing  text  in  various  fonts  and
08900	illustrations  in computer memory and to make a system for displaying
09000	them, printing hard copy and making micro-fiches.
09100	
09200		The file system  is  scheduled  to  be  operational  at  Ames
09300	Research  Center  in  the  spring of 1972.  We believe that the costs
09400	could be estimated and contracts let during the summer  of  1971  and
09500	the  first documents go into the system and programs for reading them
09600	be available in TENEX about September 1972.
09700	
09800		The Stanford  Artificial  Intelligence  Project  already  has
09900	facilities for keeping documents in the computer and displaying them,
10000	and these features are in use.  Engelbart's group at SRI has  similar
10100	experience   and  a  charter  that  is  perhaps  somewhat  closer  to
10200	maintaining a library. It is likely that Stanford AI  and  SRI  could
10300	collaborate in the early stages of estimating costs and determining a
10400	good way of carrying out the project.